Fix DeployBot recovery deadlock from stale re-pause#48
Merged
cursor[bot] merged 1 commit intoJun 22, 2026
Conversation
A failed exact-main release that an operator explicitly unpaused could never land its repair. The original CI failure lingers on the failed SHA until the repair merges, so any coordinator run (especially a workflow pinned to an older release) rereads that result, writes a fresh, byte identical pause, and overwrites the running recovery. The repair then sits behind a release that can only turn green once the repair merges. Three coupled fixes break the deadlock: - records.latest_control: a recovery now carries the exact SHA and reason it resumed, and an unconditional pause that merely restates that same already-recovered failure is ignored. A genuinely new failure (a different SHA, or a different reason such as a later deploy failure) still pauses normally, so concurrent-pause ownership is preserved. - command_react: while a recovery owns the current failed main, the release-admission fence no longer holds that SHA. The elected repair drains and advances main past the failed revision; the new main is then followed and verified normally. - command_unpause: after recording the durable recovery, DeployBot reacts immediately so the repair merges without waiting for the next delivery event or the five-minute reconciliation sweep. --no-wake opts out, and --follow/--dispatch-ci/--timeout shape the wake-up reaction. The recovery is durable, so a transient wake-up error is reported but never re-pauses the pipeline. Adds regressions covering the stale re-pause race, the reactor merging a repair during recovery, the scoping of that bypass, and the unpause wake. Co-authored-by: mberman84 <mberman84@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A failed exact-main release that an operator explicitly unpaused could never land its repair, deadlocking delivery.
Reproduction (from the report):
79cbedf6….86152d68….deploybot unpausecreated a running recovery control.release-held, and wrote a newci-failedpause.The root cause is that the original CI failure lingers on the failed SHA until the repair merges. Any coordinator run — especially a workflow pinned to an older DeployBot release — rereads that same result, writes a fresh, byte-identical pause, and overwrites the running recovery. The release-admission fence also keeps holding that failed SHA, so the repair never drains.
Fix
Three coupled changes break the deadlock and make recovery robust against stale reconciliation:
Preserve the recovery (records reconciliation). A recovery now remembers the exact SHA and reason it resumed. An unconditional pause that merely restates that already-recovered failure is ignored, so a lagging or older-version worker can no longer clobber the recovery. A genuinely new failure (a different SHA, or a different reason such as a later deploy failure) still pauses normally, preserving concurrent-pause ownership.
Merge the repair (reactor admission fence). While a recovery owns the current failed main, the release-admission fence no longer holds that SHA. The elected repair drains and advances main past the failed revision, and the new main is then followed and verified normally.
Wake immediately (unpause). After recording the durable recovery, DeployBot reacts right away so the repair merges without waiting for the next delivery event or the five-minute reconciliation sweep.
--no-wakeopts out, and--follow/--dispatch-ci/--timeoutshape that wake-up reaction. The recovery is already durable, so a transient wake-up error is reported but never re-pauses the pipeline.Tests
New regressions cover the stale re-pause race, a genuinely new failure still pausing, the reactor merging a repair during recovery (and the bypass staying scoped to the exact unpaused SHA), and the unpause wake (including opt-out and durable-recovery-survives-wake-failure).
All existing behavior is preserved. CI parity verified locally:
ruff check src tests,python -m unittest discover -s tests(251 tests OK), andpython -m buildall pass.Note on versioning
The runtime pin (
RELEASE_COMMIT/vX.Y.Zreferences in README, action, and client configs) points at a published release commit, so it is intentionally left to the separate release-pin step once this fix has a merge commit. The code fix itself is version-independent and robust even during a rolling upgrade.